We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.
translated by 谷歌翻译
Eliminating ghosting artifacts due to moving objects is a challenging problem in high dynamic range (HDR) imaging. In this letter, we present a hybrid model consisting of a convolutional encoder and a Transformer decoder to generate ghost-free HDR images. In the encoder, a context aggregation network and non-local attention block are adopted to optimize multi-scale features and capture both global and local dependencies of multiple low dynamic range (LDR) images. The decoder based on Swin Transformer is utilized to improve the reconstruction capability of the proposed model. Motivated by the phenomenal difference between the presence and absence of artifacts under the field of structure tensor (ST), we integrate the ST information of LDR images as auxiliary inputs of the network and use ST loss to further constrain artifacts. Different from previous approaches, our network is capable of processing an arbitrary number of input LDR images. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method by comparing it with existing state-of-the-art HDR deghosting models. Codes are available at https://github.com/pandayuanyu/HSTHdr.
translated by 谷歌翻译
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization. For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark. By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks. The code and data are available at https://github.com/pleaseconnectwifi/DANCE
translated by 谷歌翻译
The ability to associate touch with sight is essential for tasks that require physically interacting with objects in the world. We propose a dataset with paired visual and tactile data called Touch and Go, in which human data collectors probe objects in natural environments using tactile sensors, while simultaneously recording egocentric video. In contrast to previous efforts, which have largely been confined to lab settings or simulated environments, our dataset spans a large number of "in the wild" objects and scenes. To demonstrate our dataset's effectiveness, we successfully apply it to a variety of tasks: 1) self-supervised visuo-tactile feature learning, 2) tactile-driven image stylization, i.e., making the visual appearance of an object more consistent with a given tactile signal, and 3) predicting future frames of a tactile signal from visuo-tactile inputs.
translated by 谷歌翻译
The ultimate goal of continuous sign language recognition(CSLR) is to facilitate the communication between special people and normal people, which requires a certain degree of real-time and deploy-ability of the model. However, in the previous research on CSLR, little attention has been paid to the real-time and deploy-ability. In order to improve the real-time and deploy-ability of the model, this paper proposes a zero parameter, zero computation temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution, which enables 2D convolution to have strong spatial-temporal modelling capability with zero parameter increase and lower deployment cost compared with other spatial-temporal convolutions. The overall CSLR model based on TSCM is built on the improved ResBlockT network in this paper. The hybrid convolution of "TSCM+2D convolution" is applied to the ResBlock of the ResNet network to form the new ResBlockT, and random gradient stop and multi-level CTC loss are introduced to train the model, which reduces the final recognition WER while reducing the training memory usage, and extends the ResNet network from image classification task to video recognition task. In addition, this study is the first in CSLR to use only 2D convolution extraction of sign language video temporal-spatial features for end-to-end learning for recognition. Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.
translated by 谷歌翻译
通过探索跨视图一致性,例如,光度计一致性和3D点云的一致性,在自我监督的单眼深度估计(SS-MDE)中取得了显着进步。但是,它们非常容易受到照明差异,遮挡,无纹理区域以及移动对象的影响,使它们不够强大,无法处理各种场景。为了应对这一挑战,我们在本文中研究了两种强大的跨视图一致性。首先,相邻帧之间的空间偏移场是通过通过可变形对齐来从其邻居重建参考框架来获得的,该比对通过深度特征对齐(DFA)损失来对齐时间深度特征。其次,计算每个参考框架及其附近框架的3D点云并转换为体素空间,在其中计算每个体素中的点密度并通过体素密度比对(VDA)损耗对齐。通过这种方式,我们利用了SS-MDE的深度特征空间和3D体素空间的时间连贯性,将“点对点”对齐范式转移到“区域到区域”。与光度一致性损失以及刚性点云对齐损失相比,由于深度特征的强大代表能力以及对上述挑战的素密度的高公差,提出的DFA和VDA损失更加强大。几个户外基准的实验结果表明,我们的方法的表现优于当前最新技术。广泛的消融研究和分析验证了拟议损失的有效性,尤其是在具有挑战性的场景中。代码和型号可在https://github.com/sunnyhelen/rcvc-depth上找到。
translated by 谷歌翻译
磁共振指纹(MRF)是一种新型技术,它同时估算了多个与组织相关的参数,例如纵向松弛时间T1,横向松弛时间T2,离子频率B0和质子密度,从仅在二十秒内的扫描对象, 。但是,MRF方法遭受混乱的伪像,因为它明显地示例了K空间数据。在这项工作中,我们提出了一个基于MRF方法同时估算多个组织相关参数的压缩传感(CS)框架。它比低采样比更健壮,因此在估计对象所有体素的MR参数方面更有效。此外,MRF方法需要从具有L2距离的MR-Signal-Evolution词典中鉴定出最接近的查询指纹原子。但是,我们观察到L2距离并不总是是测量MR指纹之间相似性的合适度量。从不足采样的训练数据中自适应地学习距离度量,可以显着提高查询指纹的匹配精度。广泛的模拟案例的数值结果表明,就参数估计的准确性而言,我们的方法基本上优于先进方法。
translated by 谷歌翻译
许多顺序决策问题可以作为自适应的下管最大化问题。但是,该领域中的大多数现有研究都集中在基于池的设置上,在该设置中,人们可以按任何顺序选择项目,而对于基于流的设置,项目以任意顺序到达,并且必须立即确定是否可以立即决定在到达时选择或不选择项目。在本文中,我们介绍了一类新的实用程序功能,即半准时函数。我们开发了一系列有效的算法,以最大程度地提高基于流的设置下的半脉冲下函数。
translated by 谷歌翻译
许多顺序决策问题,包括基于池的主动学习和自适应病毒营销,可以作为适应性的下调性最大化问题。关于自适应下调优化的大多数研究都集中在单调病例或非单调性病例上。具体而言,如果实用程序函数是单调的,并且自适应子管道,则\ cite {golovin2011Adaptive}制定了一种贪婪的策略,该策略可以达到$(1-1/e)$近似值,但要受到基数约束。如果实用程序函数是非单调性的,并且自适应子模块,则\ cite {tang2021beyond}表明,随机贪婪的策略达到了$ 1/e $ $ $的近似比,但受到基数约束。在这项工作中,我们旨在通过研究部分超声酮自适应下调最大化问题来概括上述结果。为此,我们介绍了[0,1] $中自适应单调性比率$ m \的表示法,以测量功能的单调性程度。我们的主要结果是表明,如果实用程序功能为$ M $ - 适应性单调和自适应子管道。值得注意的是,当$ m = 0 $和$ m = 1 $时,此结果将恢复上述$(1-1/e)$和$ 1/e $的近似值。我们进一步扩展了结果,以考虑背包约束。我们表明,如果实用程序功能为$ M $ $ - 适应性单调和自适应子模型,则基于抽样的策略的近似值为$(M+1)/10 $。我们结果的一个重要含义是,即使对于非马可分子实用程序函数,如果此函数与单调函数``clote'',我们仍然可以达到接近$(1-1/e)$的近似值。对于许多机器学习应用程序,其实用程序功能几乎是自适应单调的,这会改善性能界限。
translated by 谷歌翻译
对象检测器的复杂性过度权衡是资源约束视觉任务的关键问题。先前的作品强调了用有效的骨干实现的检测器。在这项工作中,研究了对检测负责人对提案处理的这种权衡的影响。假设提高的检测效率需要范式转移,朝着不平等的建议处理,将更多的计算分配给良好的建议,而不是贫穷的建议。这可以更好地利用可用的计算预算,从而为同一失败提供了更高的精度。我们将其作为一个学习问题提出,目的是将操作员分配给检测头的建议,以便将总计算成本受到限制,并且精确度最大。关键发现是,可以将这种匹配作为一个函数,该函数将每个提案嵌入到操作员的单速代码中。尽管此功能诱导了复杂的动态网络路由机制,但它可以由简单的MLP实现,并通过现成的对象检测器端到端学习。这种“动态建议处理”(DPP)显示出明确的计算复杂性的明确余量,表现出优于最先进的端到端对象检测器(DETR,稀疏R-CNN)。
translated by 谷歌翻译